Dataset - Seoul (Districts), South Korea Air Pollution

Website Data Location: http://kosis.kr/eng/statisticsList/statisticsListIndex.do?menuId=M_01_01&vwcd=MT_ETITLE&parmTabId=M_01_01&statId=1980006&themaId=#T_7.2

Structure:

  • Period – Date of air pollution measurement
  • 25 Seoul Districts – District of Seoul, South Korea

Imports

In [1]:
import numpy as np
import pandas as pd
import requests as req

from scipy import stats
from bs4 import BeautifulSoup as bs

Definitions

This function scrapes the webpage where the data is hosted and parses it out into a pandas dataframe

In [2]:
def scrapeWebpage(dataSoup):
    
    # Find the table that holds the data on the webpage
    dataTable = dataSoup.find("table", {'class': 'js-csv-data csv-data js-file-line-container'})
    
    # Find all the table rows of the data table
    dataTr = dataTable.findAll('tbody')[0].findAll('tr')[1:]
    
    # Parse all table rows into data rows for the dataframe
    dataRows = [[td.get_text().strip().replace('*', '') for td in tr.findAll('td')] for tr in dataTr]
    
    # Create dataframe from the parsed data on the webage
    airData = pd.DataFrame(dataRows, columns=dataRows[0]).drop(index=[0, 1]).reset_index(drop=True).drop(columns=[''])
    
    return airData

This function reformats the date column into a valid date format (month/year)

In [3]:
def reformatDate(airData):
    for i in range(len(airData['Measurement_Date'])):
        
        # Split the date into two parts for reformatting
        dateParts = airData['Measurement_Date'][i].replace('.', '').split(' ')
        
        # Reformat dateinto valid format and insert back into dataframe
        airData['Measurement_Date'][i] = str(int(dateParts[1])) + '/' + dateParts[0]
    
    return airData

This function reshapes the dataframe parsed from the webage to match the format of the CSV data from part 1 of this project

In [4]:
def reshapeDf(airData):
    tempDfList = []
    
    # This loop creates a list of temporary dataframes that each hold the data for each district of Seoul
    for col in airData.keys().drop('Measurement_Date'):
        newDf = pd.DataFrame(airData[['Measurement_Date', col]]).rename(columns={col: 'PM2.5'})
        newDf.insert(1, column='District', value=[col] * len(airData))

        tempDfList.append(newDf)
    
    # Empty dataframe is created with new column names
    tempDf = pd.DataFrame(columns=['Measurement_Date', 'District', 'PM2.5'])
    
    # Loops through the temp dataframe list and appends them all into one dataframe
    airData = tempDf.append([df for df in tempDfList])
    
    # Populate and insert the Country and City columns into the dataframe
    airData.insert(1, column='Country', value=['Republic of Korea'] * len(airData))
    airData.insert(2, column='City', value=['Seoul'] * len(airData))
    
    return airData

This function applies a filter to the reformatted data to remove any outliers that might skew the data. The filter checks to make sure that the data in the data column(s) (PM2.5) is within -3 and +3 standard deviations away from the mean for that column.

In [5]:
def filterData(data):
    # The column(s) that hold numeric data
    dataCols = ['PM2.5']

    # This applies a filter to all the data columns of the dataframe:
    # * For each column, it first computes the Z-score of each value 
    #   in the column relative to the column mean and standard deviation.
    # * If the score is not within -3 and +3 standard deviations away from the mean for that 
    #   column, then the record is filtered out of the dataframe (thus removing the outliers)
    filteredData = data[(np.abs(stats.zscore(data[dataCols])) < 3).all(axis=1)]
    
    # This filter removes any data that is less than zero because
    # the measurement of pollutants in the air cannot go below zero
    filteredData = filteredData[(filteredData[dataCols] >= 0).all(axis=1)]

    print('Total number of rows BEFORE data is removed: {:,}\n Total number of rows AFTER data is removed: {:,}\n'
          '====================================================\n\t       Total number of rows removed: {:>3,}'
          .format(len(data.index), len(filteredData.index), len(data.index) - len(filteredData.index)))
    
    return filteredData

This function calculates the AQI (Air Quality Index) value, typically calculated from PM2.5 and determines its risk level. The AQI formula, value ranges and risk levels were all taken from the EPA (Environmental Protection Agency) of the USA.

In [6]:
def calculate_AQI(airData):
    airData = airData.reset_index()
    
    aqiValues = []
    riskLevel = []

    for i in range(len(airData)):
        i_low = 0
        i_high = 0

        c_low = 0
        c_high = 0

        # PM2.5 AQI Value Calculation
        if 0 <= airData['PM2.5'][i] <= 12:
            c_low = 0
            c_high = 12

            i_low = 0
            i_high = 50
            
        elif 12.1 <= airData['PM2.5'][i] <= 35.4:
            c_low = 12.1
            c_high = 35.4

            i_low = 51
            i_high = 100
            
        elif 35.5 <= airData['PM2.5'][i] <= 55.4:
            c_low = 35.5
            c_high = 55.4

            i_low = 101
            i_high = 150
            
        elif 55.5 <= airData['PM2.5'][i] <= 150.4:
            c_low = 55.5
            c_high = 150.4

            i_low = 151
            i_high = 200
            
        elif 150.5 <= airData['PM2.5'][i] <= 250.4:
            c_low = 150.5
            c_high = 250.4

            i_low = 201
            i_high = 300
            
        elif 250.5 <= airData['PM2.5'][i] <= 350.4:
            c_low = 250.5
            c_high = 350.4

            i_low = 301
            i_high = 400
            
        elif 350.5 <= airData['PM2.5'][i] <= 500.4:
            c_low = 350.5
            c_high = 500.4

            i_low = 401
            i_high = 500
        
        # AQI Formula
        aqiValues += [int(round(((i_high - i_low) / (c_high - c_low)) * 
                                (airData['PM2.5'][i] - c_low) + i_low, 0))]

        # Determine AQI Risk Level
        if 0 <= aqiValues[i] <= 50:
            riskLevel += ['Good']
        elif 51 <= aqiValues[i] <= 100:
            riskLevel += ['Moderate']
        elif 101 <= aqiValues[i] <= 150:
            riskLevel += ['Unhealthy for Sensitive Groups']
        elif 151 <= aqiValues[i] <= 200:
            riskLevel += ['Unhealthy']
        elif 201 <= aqiValues[i] <= 300:
            riskLevel += ['Very Unhealthy']
        elif 301 <= aqiValues[i] <= 500:
            riskLevel += ['Hazardous']
    
    # Add the AQI values to the data frame
    airData['AQI_(PM2.5)'] = aqiValues
    airData['AQI_Risk_Level'] = riskLevel
    
    # Remove index column that was added after the dataframe was reindexed
    airData.drop(['index'], axis=1, inplace=True)
    
    return airData

Website Data Manipulation

Read Website Data

In [7]:
dataSoup = bs(req.get('https://github.com/Robert-Zacchigna/DSC-680_Portfolio/blob/main/'
                      'Air Pollution Analysis and Data Scraping of Seoul (Districts), South Korea/Air_Data/'
                      'Air_Pollution_by_Measurement_Network_PM2.5-(Website_Data).csv').content, features="lxml")

Reformat Website Data

Old Website Data

In [8]:
airData = scrapeWebpage(dataSoup).rename(columns={'Period': 'Measurement_Date'})

airData.head()
Out[8]:
Measurement_Date Jung-gu Jongno-gu Jungnang-gu Dongdaemun-gu Eunpyeong-gu Mapo-gu Yeongdeungpo-gu Dongjak-gu Gwanak-gu ... Gwangjin-gu Seongdong-gu Dobong-gu Seodaemun-gu Guro-gu Seocho-gu Gangseo-gu Songpa-gu Yangcheon-gu Nowon-gu
0 2017. 01 34 34 35 33 30 38 32 34 34 ... 36 31 30 29 32 27 31 29 30 34
1 2017. 02 29 29 30 27 25 32 28 28 30 ... 31 28 26 26 28 27 27 28 32 28
2 2017. 03 38 40 38 34 36 50 38 37 41 ... 44 45 34 37 37 44 36 40 44 37
3 2017. 04 26 26 26 22 25 33 24 26 27 ... 25 28 25 28 25 28 25 24 31 26
4 2017. 05 22 23 27 21 26 28 21 23 28 ... 23 27 25 27 24 24 24 26 29 25

5 rows × 26 columns

In [9]:
print('Initial size of data: {:,} columns and {:,} rows'.format(airData.shape[1], airData.shape[0]))
Initial size of data: 26 columns and 35 rows

New Reformatted Website Data

In [10]:
airData = reformatDate(airData)

airData.head()
Out[10]:
Measurement_Date Jung-gu Jongno-gu Jungnang-gu Dongdaemun-gu Eunpyeong-gu Mapo-gu Yeongdeungpo-gu Dongjak-gu Gwanak-gu ... Gwangjin-gu Seongdong-gu Dobong-gu Seodaemun-gu Guro-gu Seocho-gu Gangseo-gu Songpa-gu Yangcheon-gu Nowon-gu
0 1/2017 34 34 35 33 30 38 32 34 34 ... 36 31 30 29 32 27 31 29 30 34
1 2/2017 29 29 30 27 25 32 28 28 30 ... 31 28 26 26 28 27 27 28 32 28
2 3/2017 38 40 38 34 36 50 38 37 41 ... 44 45 34 37 37 44 36 40 44 37
3 4/2017 26 26 26 22 25 33 24 26 27 ... 25 28 25 28 25 28 25 24 31 26
4 5/2017 22 23 27 21 26 28 21 23 28 ... 23 27 25 27 24 24 24 26 29 25

5 rows × 26 columns

In [11]:
airData = reshapeDf(airData)

airData.head()
Out[11]:
Measurement_Date Country City District PM2.5
0 1/2017 Republic of Korea Seoul Jung-gu 34
1 2/2017 Republic of Korea Seoul Jung-gu 29
2 3/2017 Republic of Korea Seoul Jung-gu 38
3 4/2017 Republic of Korea Seoul Jung-gu 26
4 5/2017 Republic of Korea Seoul Jung-gu 22
In [12]:
print('Size of data after reformatting: {:,} columns and {:,} rows'.format(airData.shape[1], airData.shape[0]))
Size of data after reformatting: 5 columns and 875 rows

Remove Outliers and Invalid Values from Website Data

In [13]:
airData = airData[airData['PM2.5'] != '-']

airData['PM2.5'] = pd.to_numeric(airData['PM2.5'])

filteredData = filterData(airData)
Total number of rows BEFORE data is removed: 874
 Total number of rows AFTER data is removed: 872
====================================================
	       Total number of rows removed:   2

Calculate AQI Data

In [14]:
airData = calculate_AQI(filteredData)

# Two new columns added: AQI_(PM2.5) and AQI_Risk_Level
print('New columns: ' + ', '.join(list(airData.keys())))
New columns: Measurement_Date, Country, City, District, PM2.5, AQI_(PM2.5), AQI_Risk_Level
In [15]:
airData.head()
Out[15]:
Measurement_Date Country City District PM2.5 AQI_(PM2.5) AQI_Risk_Level
0 1/2017 Republic of Korea Seoul Jung-gu 34 97 Moderate
1 2/2017 Republic of Korea Seoul Jung-gu 29 87 Moderate
2 3/2017 Republic of Korea Seoul Jung-gu 38 107 Unhealthy for Sensitive Groups
3 4/2017 Republic of Korea Seoul Jung-gu 26 80 Moderate
4 5/2017 Republic of Korea Seoul Jung-gu 22 72 Moderate

Output Reformed Website Data to CSV

In [16]:
print('Final size of data: {:,} columns and {:,} rows'.format(airData.shape[1], airData.shape[0]))
Final size of data: 7 columns and 872 rows
In [17]:
airData.to_csv('Reformed_Data/Website-Air_Pollution_Data-(Reformed_and_AQI_Values).csv', index=False)
In [ ]: